Ordered compaction and inlining with full catalog integration #642

Alex-Monahan · 2025-12-22T19:00:06Z

Hi folks!

This is a new PR meant to solve the same use case as #593, but addressing the PR feedback! Thank you for the guidance - it was super helpful. I am still open to any changes you recommend!

This adds 2 new tables to the DuckLake spec: ducklake_sort_info and ducklake_sort_expression.

ducklake_sort_info keeps a version history of the sort settings for tables over time. It has 1 row per time that a table has a new sort setting applied. (The prior PR used an option for this, but that has been removed based on the feedback).

ducklake_sort_expression tracks the details of that sort. For each time a new sort setting is applied, this table includes one row for each expression in the order by. (If I order by column3 asc, column42 desc, column64 asc, then there will be 3 rows.)

There are still a few limitations with this PR:

This does not order during insert (only during compaction and inline flush).
- I would love to try to do a follow up PR to add this!
Only explicit column names can be used in the sorting, not expressions.
- I tried to implement this but could not, so I could use some help with this piece!
- There is a friendly error message (and tests) to document this limitation.
- The spec has a column expression, so the intention was to make the spec itself forwards-compatible with expression-oriented sorting.
Files are still selected for compaction based on insertion order. It could be better to sort the list of files by min/max metadata before selecting files for compaction.
- Let me know if this is desirable and I can work on it after the "order during insert" in number 1!

I believe that I made this fully compatible with the batching code, but I was testing locally on a DuckDB catalog and not on Postgres. Any extra eyes on that side would be great!

If this looks good, I can also do any docs PRs that you recommend - happy to help there.

Thanks folks! CC @philippmd as well as an FYI

…ting.

pdet · 2026-01-13T10:08:57Z

Hey Alex, the PR is looking super solid, thanks again for all the effort you put in!
I'm wondering if you could also add the following tests:

If you drop a table with an existing order, and then create a table with the same name but different columns?

If you get the changes information on snapshot()

If the alter statement won't intefeer with compaction (see, Change schema tracking from global to table. #671)

If we can also verify the new parquet files are correctly written directly (without acessing them from the table), and also if different files have the correct orders, when having different orders over multiple snapshots.

More transaction tests, for example, what happens if we begin a transaction and in the same transaction we add a new sorting order and alter a column name?

What happens if you do a batch insertion on a table with a set column order (not inlined)

The test files you are pretty detailed, but my bio llm context window is small, do you mind also moving them to a sorted_table folder and modularizing them a bit more?
I also guess in general we can do StringUtil::Lower() for catalog access when the result can be case insensitive.

Thanks as always for the feedback! These tests make sense and I'm working on them! I had a few follow up questions.

On 3, would you like for me to validate that an alter table statement on a different table does not interfere? (Ahead of time or in the same transaction?) Based on #684, I was assuming that any alter table on the same table would be a hard line that prevents compaction - please let me know if I am wrong!

Basically, we should check that the alter statement does not increase the schema counter of a table, since the alter statement here does not change the actual schema, and the sorting is somewhat "optional" I believe that files should still be compactable before/after the alter.

On 4, let me try to repeat back to you what I think you are looking for! I think you want to test the situation where I insert into the same table 3 times, each time with a different sort order setting, and then I run 1 compaction on that table. My goal was to only consider the most recent ducklake_sort_expression setting, regardless of the value of the setting at insert time, so this test would ensure that! I could also test multiple compactions and make sure that the final compaction resorts to the latest sort setting.

I wanted to see if files would always be generated with the latest sorting (e.g., by flushing) , and also compacted with the latest sorting, but now that I think about it again, maybe is an overkill because you already have similar tests to that, let's skip it.

On 6, I'm not sure exactly how to proceed - I could use your help! This PR does not include the capability to do the sorting during an insert, so I'm not sure how the features would intersect just yet. Does adjusting the column order during insert result in the parquet file having the columns stored in a different order? I'm not sure what behavior I should be exercising with the test. Just let me know!

Let's skip this.

Alex-Monahan · 2026-01-14T15:09:31Z

When investigating 3, I found that you are correct that I am currently incrementing the schema_version when altering the sort order. Dang! I think there are few other cases where we probably want to have the same "alter table, but don't change the schema_version". I'm thinking:

SET_SORT_KEY
SET_COMMENT
SET_COLUMN_COMMENT

I'll start trying to implement that behavior, but let me know if I need to course correct! From my first prototype, it requires some additional complexity in a few spots (I think we will need a different construct than new_tables for storing those kinds of alters), so I'm open to feedback and ideas!

pdet · 2026-01-15T07:29:49Z

When investigating 3, I found that you are correct that I am currently incrementing the schema_version when altering the sort order. Dang! I think there are few other cases where we probably want to have the same "alter table, but don't change the schema_version". I'm thinking:

SET_SORT_KEY

SET_COMMENT

SET_COLUMN_COMMENT

I'll start trying to implement that behavior, but let me know if I need to course correct! From my first prototype, it requires some additional complexity in a few spots (I think we will need a different construct than new_tables for storing those kinds of alters), so I'm open to feedback and ideas!

Yes, I think none of those should impact the schema ranges for compaction. I think the main thing you have to do is to ensure that these statements won't write a new entry into ducklake_schema_versions, as this table keeps track of schema changes per ducklake table.

Alex-Monahan · 2026-01-15T15:00:06Z

Yup, I will prevent that! Unfortunately, I'm also finding that I need to update the local schema cache (since usually a new schema version would be handled differently). I'm making progress!

…chema changes and keep schema cache up to date.

…ersion

Alex-Monahan · 2026-01-16T03:44:59Z

I now have the schema_version not incrementing on SET_SORT_KEY, SET_COMMENT, and SET_COLUMN_COMMENT!

I split the tests apart and also added tests for 1, 2, 3, and 5! I added quite a lot of tests to ensure transactionality for 5. Since alter table can't happen in the same transaction as a compaction (Getting the deliberate Transactions can either make changes OR perform compaction - not both error), that meant fewer cases in the compaction case than the inlining.

I also added tests for the schema_version non-incrementing for the comment cases.

Let me know what you think!

Alex-Monahan · 2026-01-16T03:57:27Z

So, to fix the assertion issues on my fork's CI, I had to relax an assertion. Please let me know if I am off base, but I think that the assertion was too tight.

In src/storage/ducklake_catalog.cpp lines 439-461, the table_entry_map can have views added to it. However, there was an assertion in DuckLakeCatalogSet::GetEntryById(TableIndex index) that required a table (and not a view). Allowing a view there appears to solve things.

Alex-Monahan · 2026-01-18T21:32:04Z

As I am thinking more about this, I'm wondering if the updates I made to the catalog cache would be safe across processes. Is it safe to not update the schema_version? Could other processes use a stale sort order, or reset it back to being unsorted?

pdet · 2026-01-19T10:09:40Z

So, to fix the assertion issues on my fork's CI, I had to relax an assertion. Please let me know if I am off base, but I think that the assertion was too tight.

In src/storage/ducklake_catalog.cpp lines 439-461, the table_entry_map can have views added to it. However, there was an assertion in DuckLakeCatalogSet::GetEntryById(TableIndex index) that required a table (and not a view). Allowing a view there appears to solve things.

Could you point me out to which test broke this requirement?

From what I can tell, there are other parts of the code that require this to return a table, as we deference it to a DuckLakeTableEntry

e.g.,

unique_ptr<DuckLakeStats> DuckLakeCatalog::ConstructStatsMap(vector<DuckLakeGlobalStatsInfo> &global_stats,
                                                             DuckLakeCatalogSet &schema) {
	auto lake_stats = make_uniq<DuckLakeStats>();
	for (auto &stats : global_stats) {
		// find the referenced table entry
		auto table_entry = schema.GetEntryById(stats.table_id);

pdet · 2026-01-19T10:12:46Z

As I am thinking more about this, I'm wondering if the updates I made to the catalog cache would be safe across processes. Is it safe to not update the schema_version? Could other processes use a stale sort order, or reset it back to being unsorted?

Maybe this is something we can test with concurrentloops? (see test/sql/snapshot_info/ducklake_last_commit.test)

Alex-Monahan · 2026-01-19T13:50:37Z

So, to fix the assertion issues on my fork's CI, I had to relax an assertion. Please let me know if I am off base, but I think that the assertion was too tight.
In src/storage/ducklake_catalog.cpp lines 439-461, the table_entry_map can have views added to it. However, there was an assertion in DuckLakeCatalogSet::GetEntryById(TableIndex index) that required a table (and not a view). Allowing a view there appears to solve things.

Could you point me out to which test broke this requirement?

From what I can tell, there are other parts of the code that require this to return a table, as we deference it to a DuckLakeTableEntry

e.g.,
unique_ptr<DuckLakeStats> DuckLakeCatalog::ConstructStatsMap(vector<DuckLakeGlobalStatsInfo> &global_stats,
                                                             DuckLakeCatalogSet &schema) {
	auto lake_stats = make_uniq<DuckLakeStats>();
	for (auto &stats : global_stats) {
		// find the referenced table entry
		auto table_entry = schema.GetEntryById(stats.table_id);

Sure! The tests that broke are here in this CI run on my fork. They were all running a query like

COMMENT ON VIEW ducklake.comment_view IS 'con1';

I can create a view-specific GetEntryById function if that would be better!

Alex-Monahan · 2026-01-19T17:36:22Z

As I am thinking more about this, I'm wondering if the updates I made to the catalog cache would be safe across processes. Is it safe to not update the schema_version? Could other processes use a stale sort order, or reset it back to being unsorted?

Maybe this is something we can test with concurrentloops? (see test/sql/snapshot_info/ducklake_last_commit.test)

Unfortunately, I believe that concurrentloop uses the same DuckDB instance with multiple threads, and the DuckLake catalog is only created once per ATTACH, so it is shared across all threads. I think the issue that might exist from no longer incrementing schema_version would be when two totally separate DuckLakeCatalog instances have different sort information in their cache (with the same schema_version). Is there a multi-process version of concurrentloop? Or maybe a C++ or Python test? Do you want me to remove the schema_version modifications and save them for a later PR?

pdet · 2026-01-19T19:15:18Z

As I am thinking more about this, I'm wondering if the updates I made to the catalog cache would be safe across processes. Is it safe to not update the schema_version? Could other processes use a stale sort order, or reset it back to being unsorted?

Maybe this is something we can test with concurrentloops? (see test/sql/snapshot_info/ducklake_last_commit.test)

Unfortunately, I believe that concurrentloop uses the same DuckDB instance with multiple threads, and the DuckLake catalog is only created once per ATTACH, so it is shared across all threads. I think the issue that might exist from no longer incrementing schema_version would be when two totally separate DuckLakeCatalog instances have different sort information in their cache (with the same schema_version). Is there a multi-process version of concurrentloop? Or maybe a C++ or Python test? Do you want me to remove the schema_version modifications and save them for a later PR?

Could we achieve this with multiple connections then? Because that's also possible within sqltests

Alex-Monahan · 2026-01-20T02:19:29Z

As I am thinking more about this, I'm wondering if the updates I made to the catalog cache would be safe across processes. Is it safe to not update the schema_version? Could other processes use a stale sort order, or reset it back to being unsorted?

Maybe this is something we can test with concurrentloops? (see test/sql/snapshot_info/ducklake_last_commit.test)

Unfortunately, I believe that concurrentloop uses the same DuckDB instance with multiple threads, and the DuckLake catalog is only created once per ATTACH, so it is shared across all threads. I think the issue that might exist from no longer incrementing schema_version would be when two totally separate DuckLakeCatalog instances have different sort information in their cache (with the same schema_version). Is there a multi-process version of concurrentloop? Or maybe a C++ or Python test? Do you want me to remove the schema_version modifications and save them for a later PR?

Could we achieve this with multiple connections then? Because that's also possible within sqltests

I am not sure! If you have a spot where I can find an example, I can give it a shot.

To understand the behavior, I made a Python script that kicks off 2 separate CLI processes. I found that the sort is ignored by the other process if the schema was already cached ahead of time. The good news is that the catalog DB itself continues to have the right values, but the cache does not get invalidated correctly.

The flow is:

Process 1: connects, creates the table and inserts into it
Process 2: connects and runs an ALTER TABLE ADD COLUMN, which caches the catalog
Process 1: ALTER TABLE SET SORTED BY
Process 1: Completes / exits
Process 2: Compacts (using the cached catalog)
Process 2: Pulls updated table (which does not show the right order, since the cached catalog was used)
Process 2: Completes / exits

If I omit the ALTER TABLE ADD COLUMN step in process 2, then there is no issue and the sort occurs correctly.

What do you recommend I do? I've thought about 3 options, but 3 would need some help!

Keep the schema_version from incrementing, but accept this concurrency behavior.
Allow the schema_version to increment but accept that a compaction barrier gets put in when sort is changed
Keep the schema_version from incrementing but find some other way to correctly invalidate the cache or use a different key for the cache

ducklake_set_sorted_multiprocess_add_column.py

The logs get printed out for process 1 before printing out process 2, but I logged out some timestamps to show the true order:

uv run ./test/ducklake_set_sorted_multiprocess_add_column.py
┌─────────┬──────────────────────┐
│ process │     sum("range")     │
│ varchar │        int128        │
├─────────┼──────────────────────┤
│ sql_1   │ 12499999997500000000 │
└─────────┴──────────────────────┘
┌─────────┬─────────────────────────────────────────────────────────────┐
│ process │ (CAST(now() AS VARCHAR) || ' sql_1 finished SET SORTED BY') │
│ varchar │                           varchar                           │
├─────────┼─────────────────────────────────────────────────────────────┤
│ sql_1   │ 2026-01-19 19:16:39.22008-07 sql_1 finished SET SORTED BY   │
└─────────┴─────────────────────────────────────────────────────────────┘


┌─────────┬─────────────────────┐
│ process │    sum("range")     │
│ varchar │       int128        │
├─────────┼─────────────────────┤
│ sql_2   │ 1999999999000000000 │
└─────────┴─────────────────────┘
┌─────────┬────────────────────────────────────────────────────────────────────┐
│ process │ (CAST(now() AS VARCHAR) || ' sql_2 finished adding column') │
│ varchar │                              varchar                               │
├─────────┼────────────────────────────────────────────────────────────────────┤
│ sql_2   │ 2026-01-19 19:16:35.416129-07 sql_2 finished adding column  │
└─────────┴────────────────────────────────────────────────────────────────────┘
┌─────────┬──────────────────────┐
│ process │     sum("range")     │
│ varchar │        int128        │
├─────────┼──────────────────────┤
│ sql_2   │ 40499999995500000000 │
└─────────┴──────────────────────┘
┌─────────┬───────────────────────────────────────────────────────┐
│ process │ (CAST(now() AS VARCHAR) || ' sql_2 about to compact') │
│ varchar │                        varchar                        │
├─────────┼───────────────────────────────────────────────────────┤
│ sql_2   │ 2026-01-19 19:16:46.373997-07 sql_2 about to compact  │
└─────────┴───────────────────────────────────────────────────────┘
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ 0 rows  │
└─────────┘
┌─────────┬───────────┬────────────┬────────────┬─────────┐
│ process │ unique_id │ sort_key_1 │ sort_key_2 │  bonus  │
│ varchar │   int64   │   int64    │  varchar   │ varchar │
├─────────┼───────────┼────────────┼────────────┼─────────┤
│ sql_2   │         3 │          1 │ woot3      │ NULL    │
│ sql_2   │         2 │          0 │ woot2      │ NULL    │
│ sql_2   │         1 │          1 │ woot1      │ NULL    │
│ sql_2   │         0 │          0 │ woot0      │ NULL    │
│ sql_2   │         7 │          1 │ woot7      │ NULL    │
│ sql_2   │         6 │          0 │ woot6      │ NULL    │
│ sql_2   │         5 │          1 │ woot5      │ NULL    │
│ sql_2   │         4 │          0 │ woot4      │ NULL    │
└─────────┴───────────┴────────────┴────────────┴─────────┘
┌─────────┬─────────────┬────────────────┬────────────────────────────────────────────┐
│ process │ snapshot_id │ schema_version │                  changes                   │
│ varchar │    int64    │     int64      │          map(varchar, varchar[])           │
├─────────┼─────────────┼────────────────┼────────────────────────────────────────────┤
│ sql_2   │           0 │              0 │ {schemas_created=[main]}                   │
│ sql_2   │           1 │              1 │ {tables_created=[main.sort_on_compaction]} │
│ sql_2   │           2 │              1 │ {tables_inserted_into=[1]}                 │
│ sql_2   │           3 │              1 │ {tables_inserted_into=[1]}                 │
│ sql_2   │           4 │              2 │ {tables_altered=[1]}                       │
│ sql_2   │           5 │              2 │ {tables_altered=[1]}                       │
│ sql_2   │           6 │              2 │ {}                                         │
└─────────┴─────────────┴────────────────┴────────────────────────────────────────────┘
┌─────────┬──────────┬────────────────┬──────────────┬────────────────┬────────────┬────────────────┬────────────┐
│ process │ table_id │ begin_snapshot │ end_snapshot │ sort_key_index │ expression │ sort_direction │ null_order │
│ varchar │  int64   │     int64      │    int64     │     int64      │  varchar   │    varchar     │  varchar   │
├─────────┼──────────┼────────────────┼──────────────┼────────────────┼────────────┼────────────────┼────────────┤
│ sql_2   │        1 │              5 │         NULL │              0 │ sort_key_1 │ DESC           │ NULLS_LAST │
│ sql_2   │        1 │              5 │         NULL │              1 │ sort_key_2 │ DESC           │ NULLS_LAST │
└─────────┴──────────┴────────────────┴──────────────┴────────────────┴────────────┴────────────────┴────────────┘

Alex-Monahan added 30 commits October 4, 2025 21:33

Working hardcoded compaction sort!

eb59a29

parse order by string and manually bind.

f2b5947

approx_order_by param in merge_adjacent_files

856e479

config option for approx_order_by

3b52466

Refactor into a method on DuckLakeCompactor

3b24a22

Move compactor to .hpp, GetApproxOrderBy fn, WIP inlined order by

d670375

working tests for ducklake_flush_inlined_data!

b214ccd

Use const references. More commented out attempt to dynamically bind

295e7f5

rename parameter to local_order_by

e67d5e2

Rename fns, - comments/prints,+ negative test

87964df

Merge remote-tracking branch 'origin' into ordered-compaction

cf11b9b

Accept SET SORTED BY syntax (do nothing yet)

d12e237

SET SORTED inserts to DB! Lots of wiring...

62beb45

Working compact with sort_data instead of option! WIP cleanup.

41da46b

Update tests to use new syntax. Passing!

cf6c617

Inlining tests using new syntax working!

6bc926b

Remove local_order_by option and naming

d9e4e61

inline flush sorted within txn works. Not compaction (seems ok?). Tes…

f8b4425

…ting.

Revert rename of duckdb property

1449bdb

RESET SORTED BY works and is tested!

f55888b

If sorts match, don't insert to catalog

10d527d

Merge branch 'main' into ordered-compaction-catalog

f89e12a

Remove duplicate InsertSort fn

e3bf7a4

batch_queries for sort catalog operations

6b612cb

Remove comment block

a0a2424

retarget to duckdb main

4c13a28

try to point duckdb submodule to correct commit

61cc765

extension-ci-tools git commit hash fix

c988bed

Undo old ducklake_option edits. Edit comments

b73fc7d

Add FIXME for expressions in sort. (And re-run CI/CD)

a1a6b26

Alex-Monahan added 2 commits January 13, 2026 15:22

test that SET SORTED BY applies by table_id not by name

fb5dd0b

test that sort catalog tables cleared out by expire_snapshots

918ebf3

Alex-Monahan added 9 commits January 15, 2026 08:53

Don't update schema_version on SET SORTED BY. Track updates without s…

27cfa9a

…chema changes and keep schema cache up to date.

format fixes

7ae9384

Comments on tables, table columns, and views do not increase schema_v…

bd8c991

…ersion

format fix

58f23df

add test for snapshots() changes

79064e6

Fix error message

f17d7f2

Fix error message in test

dba148b

Fix another error message

fcca281

Tests: Rollbacks, sort & rename cols in same txn

062d00b

Alex-Monahan marked this pull request as ready for review January 16, 2026 03:44

Alex-Monahan marked this pull request as draft January 16, 2026 03:57

Alex-Monahan added 2 commits January 15, 2026 20:58

format fixes

cc4488f

Relax assertion: the table_entry_map contains tables and views.

f24ddc2

Alex-Monahan marked this pull request as ready for review January 16, 2026 04:19

std::string to string

c55a083

Ordered compaction and inlining with full catalog integration #642

Are you sure you want to change the base?

Ordered compaction and inlining with full catalog integration #642

Uh oh!

Conversation

Alex-Monahan commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pdet commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alex-Monahan commented Jan 14, 2026

Uh oh!

pdet commented Jan 15, 2026

Uh oh!

Alex-Monahan commented Jan 15, 2026

Uh oh!

Alex-Monahan commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alex-Monahan commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alex-Monahan commented Jan 18, 2026

Uh oh!

pdet commented Jan 19, 2026

Uh oh!

pdet commented Jan 19, 2026

Uh oh!

Alex-Monahan commented Jan 19, 2026

Uh oh!

Alex-Monahan commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pdet commented Jan 19, 2026

Uh oh!

Alex-Monahan commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Alex-Monahan commented Dec 22, 2025 •

edited

Loading

pdet commented Jan 13, 2026 •

edited

Loading

Alex-Monahan commented Jan 16, 2026 •

edited

Loading

Alex-Monahan commented Jan 16, 2026 •

edited

Loading

Alex-Monahan commented Jan 19, 2026 •

edited

Loading